7 research outputs found

    Performance engineering for HEVC transform and quantization kernel on GPUs

    Get PDF
    Continuous growth of video traffic and video services, especially in the field of high resolution and high-quality video content, places heavy demands on video coding and its implementations. High Efficiency Video Coding (HEVC) standard doubles the compression efficiency of its predecessor H.264/AVC at the cost of high computational complexity. To address those computing issues high-performance video processing takes advantage of heterogeneous multiprocessor platforms. In this paper, we present a highly performance-optimized HEVC transform and quantization kernel with all-zero-block (AZB) identification designed for execution on a Graphics Processor Unit (GPU). Performance optimization strategy involved all three aspects of parallel design, exposing as much of the application’s intrinsic parallelism as possible, exploitation of high throughput memory and efficient instruction usage. It combines efficient mapping of transform blocks to thread-blocks and efficient vectorized access patterns to shared memory for all transform sizes supported in the standard. Two different GPUs of the same architecture were used to evaluate proposed implementation. Achieved processing times are 6.03 and 23.94 ms for DCI 4K and 8K Full Format, respectively. Speedup factors compared to CPU, cuBLAS and AVX2 implementations are up to 80, 19 and 4 times respectively. Proposed implementation outperforms previous work 1.22 times

    Exploring manycore architectures for next-generation HPC systems through the MANGO approach

    Full text link
    [EN] The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 671668.Flich Cardo, J.; Agosta, G.; Ampletzer, P.; Atienza-Alonso, D.; Brandolese, C.; Cappe, E.; Cilardo, A.... (2018). Exploring manycore architectures for next-generation HPC systems through the MANGO approach. Microprocessors and Microsystems. 61:154-170. https://doi.org/10.1016/j.micpro.2018.05.011S1541706

    Sustav za pravovremeno videotraskodiranje na raznorodnim arhitekturama za računarstvo visokih performaci

    No full text
    Latest analysis show that 82% of global IP traffic will be video traffic by 2022. Handling this amount of data is a very challenging task for video content providers. Another factor that highlights this problem is the continuously growing number of different devices that can play video content. With such diversity of devices a single copy of the video cannot match requirements of all playback conditions. Just-in-Time (JiT) video transcoding has one of the key roles in resolving these issues. However, it is an extremely compute-intensive and resource-hungry process. This thesis presents a novel algorithm for reusing coding information from the input video stream. The main concept behind the proposed algorithm is to estimate the computational complexity of re-encoding each coding block based on the information retrieved from the decoded frame. The final goal is to achieve an optimal trade-off between video quality of the transcoded bitstream and coding efficiency while conforming to strict timing requirements. To achieve a more efficient solution, a hardware accelerator for inter prediction, as one of the key compute-intensive kernels, is designed and implemented. An integrated system composed of implemented algorithm and custom hardware accelerator is evaluated. Compared with JiT transcoder, the proposed solution increases video quality by 0.945 dB and reduces bitrate on average by 35.06%. Significant speedups of up to 4 times are achieved compared with transcoder without timing requirements but with average losses of 0.592 in PSNR and 21.74% in bitrate.Statistike pokazuju da će 82% globalnog Internet Protokol (IP) prometa do 2022. godine činiti video promet. Rukovanje tolikom količinom podataka predstavlja iznimno zahtjevan zadatak za poslužitelje video sadržaja. Još jedan faktor koji naglašava složenost ovog problema je činjenica da broj različitih uređaja koji mogu prikazivati video sadržaj konstantno raste. Takvu raznovrsnost uređaja nemoguće je zadovoljiti jednom verzijom videa. Pravovremeno videotranskodiranje ima ključnu ulogu u rješavanju ovog problema, ali je računalno iznimno zahtjevno. Glavni cilj ove doktorske disertacije bilo je istražiti tehnike iskorištavanja informacija o kodiranju ulaznog video toka kodiranog HEVC standardom da bi se ubrzao proces ponovnog kodiranja, ali bez značajnog negativnog utjecaja na kvalitetu videa i/ili učinkovitost kodiranja. U cilju daljnjeg poboljšanja procesa videotranskodiranja, istražene su i učinkovite izvedbe pravovremenog videotranskodiranja na raznorodnim arhitekturama za računarstvo visokih performanci i razvijen je sklopovski ubrzivač za inter predikciju, kao jednu od ključnih jezgri u procesu videotranskodiranja. Integrirano rješenje, koje se sastoji od razvijenog algoritma i specijaliziranog sklopovskog ubrzivača, postiže bolju kvalitetu videa od 0.945 dB te bolju učinkovitost kodiranja od 35.06% u odnosu na druge pravovremene transkodere. U usporedbi s drugim transkoderima koji ne ostvaruju pravovremeno transkodiranje, postiže se ubrzanje do 4 puta, uz prosječne gubitke od 0.592 dB u kvaliteti te 21.74% u učinkovitosti kodiranja

    Sustav za pravovremeno videotraskodiranje na raznorodnim arhitekturama za računarstvo visokih performaci

    No full text
    Latest analysis show that 82% of global IP traffic will be video traffic by 2022. Handling this amount of data is a very challenging task for video content providers. Another factor that highlights this problem is the continuously growing number of different devices that can play video content. With such diversity of devices a single copy of the video cannot match requirements of all playback conditions. Just-in-Time (JiT) video transcoding has one of the key roles in resolving these issues. However, it is an extremely compute-intensive and resource-hungry process. This thesis presents a novel algorithm for reusing coding information from the input video stream. The main concept behind the proposed algorithm is to estimate the computational complexity of re-encoding each coding block based on the information retrieved from the decoded frame. The final goal is to achieve an optimal trade-off between video quality of the transcoded bitstream and coding efficiency while conforming to strict timing requirements. To achieve a more efficient solution, a hardware accelerator for inter prediction, as one of the key compute-intensive kernels, is designed and implemented. An integrated system composed of implemented algorithm and custom hardware accelerator is evaluated. Compared with JiT transcoder, the proposed solution increases video quality by 0.945 dB and reduces bitrate on average by 35.06%. Significant speedups of up to 4 times are achieved compared with transcoder without timing requirements but with average losses of 0.592 in PSNR and 21.74% in bitrate.Statistike pokazuju da će 82% globalnog Internet Protokol (IP) prometa do 2022. godine činiti video promet. Rukovanje tolikom količinom podataka predstavlja iznimno zahtjevan zadatak za poslužitelje video sadržaja. Još jedan faktor koji naglašava složenost ovog problema je činjenica da broj različitih uređaja koji mogu prikazivati video sadržaj konstantno raste. Takvu raznovrsnost uređaja nemoguće je zadovoljiti jednom verzijom videa. Pravovremeno videotranskodiranje ima ključnu ulogu u rješavanju ovog problema, ali je računalno iznimno zahtjevno. Glavni cilj ove doktorske disertacije bilo je istražiti tehnike iskorištavanja informacija o kodiranju ulaznog video toka kodiranog HEVC standardom da bi se ubrzao proces ponovnog kodiranja, ali bez značajnog negativnog utjecaja na kvalitetu videa i/ili učinkovitost kodiranja. U cilju daljnjeg poboljšanja procesa videotranskodiranja, istražene su i učinkovite izvedbe pravovremenog videotranskodiranja na raznorodnim arhitekturama za računarstvo visokih performanci i razvijen je sklopovski ubrzivač za inter predikciju, kao jednu od ključnih jezgri u procesu videotranskodiranja. Integrirano rješenje, koje se sastoji od razvijenog algoritma i specijaliziranog sklopovskog ubrzivača, postiže bolju kvalitetu videa od 0.945 dB te bolju učinkovitost kodiranja od 35.06% u odnosu na druge pravovremene transkodere. U usporedbi s drugim transkoderima koji ne ostvaruju pravovremeno transkodiranje, postiže se ubrzanje do 4 puta, uz prosječne gubitke od 0.592 dB u kvaliteti te 21.74% u učinkovitosti kodiranja

    Performance-efficient integration and programming approach of DCT accelerator for HEVC in MANGO platform

    Get PDF
    Video encoding based on novel HEVC standard is an extremely computationally expensive process and achieving efficient encoding requires intelligent utilization of all available resources, from both software and hardware perspective. Profiling and analysis of the encoding process identified Discrete cosine transform (DCT) as one of the key kernels that consume most of the time in the application's runtime. Therefore, high-throughput, fully-pipelined hardware accelerator was designed in FPGA and integrated into MANGO platform. MANGO platform is heterogeneous HPC system that consists of different types of nodes, from general purpose nodes (GN) to heterogeneous nodes (HN). While executing specific kernels on GN nodes is a straight-forward process, executing kernels on accelerator-based HNs is a more complex procedure and requires specific integration to successfully exploit heterogeneous architecture. This paper presents performance-efficient integration of DCT hardware accelerator in MANGO platform, focusing on the performance of the encoder while maintaining coding efficiency and video quality of the encoded bitstream. Several approaches were considered, tested and compared; from the standalone integration where series of single tasks were offloaded to the DCT accelerator, to more complex solutions based on smart buffer utilization

    Dynamic load balancing algorithm based on HEVC tiles for just-in-time video encoding for heterogeneous architectures

    Get PDF
    This paper proposes a novel algorithm for dynamic tile partitioning to achieve the optimal workload balance for parallel processing architectures in just-in-time HEVC encoding. Tile boundaries are dynamically shifted depending on the tile cost, a value that denotes predicted computational complexity of a single tile in a frame. The overall cost of a tile is determined as a combination of costs of three computationally most expensive and resource-hungry operations in HEVC encoding: prediction, transformation, and entropy coding. The algorithm aims at exploiting different types of processing architectures, from homogeneous multicore CPU architectures to heterogeneous architectures in the actual conditions in which streaming servers operate. The experimental results show that the proposed algorithm outperforms uniform tiling, by up to 5.5% in processing time, while maintaining the same video quality and bitrate. Compared to the state-of-the-art algorithms, the proposed algorithm achieves up to 8.85% speedup depending on the number of videos that are being encoded concurrently on a video streaming server

    Highly parallel GPU accelerator for HEVC transform and quantization

    No full text
    When analysing Internet traffic today it can be found that digital video content prevails. Its domination will continue to grow in the upcoming years and reach 82% of all traffic by 2021. If converted to Internet video minutes per second, this equals about one million video minutes per second. Providing and supporting improved compression capability is therefore expected from video processing devices. This will relieve the pressure on storage systems and communication networks while creating preconditions for further development of video services. Transform and quantization is one of the most compute-intensive parts of modern hybrid video coding systems where coding algorithm itself is commonly standardized. High Efficiency Video Coding (HEVC) is state-of-the-art video coding standard which achieves high compression efficiency at the cost of high computational complexity. In this paper we present highly parallel GPU accelerator for HEVC transform and quantization which targets most common heterogeneous computing CPU+GPU system. The accelerator is implemented using CUDA programming model. All the relevant state-of-the-art techniques related to kernel vectorization, shared memory optimization and overlapping data transfers with computation were investigated, customized and carefully combined to obtain a performance efficient solution across all applicable transform sizes. The proposed solution is compared against reference implementation which uses NVIDIA cuBLAS library to perform the same work. Obtained speedup factors for DCI 4K frame are 2.46 times for largest transform size and 130.17 times for smallest transform size what revealed substantial performance gap of this library when targeting GPU of the Kepler architecture. Achieved processing time of frame transform and quantization are up to 4.82 ms
    corecore